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Abstract 



We introduce a data-driven order reduction method for nonlinear control systems, drawing on recent progress 
in machine learning and statistical dimensionality reduction. The method rests on the assumption that the nonlinear 
system behaves linearly when lifted into a high (or infinite) dimensional feature space where balanced truncation 
may be carried out implicitly. This leads to a nonlinear reduction map which can be combined with a representation 
of the system belonging to a reproducing kernel Hilbert space to give a closed, reduced order dynamical system 
which captures the essential input-output characteristics of the original model. Empirical simulations illustrating 
the approach are also provided. 



Model reduction of controlled dynamical systems has been a long standing, and as yet, unsettled 
challenge in control theory. The benefits are clear: a low dimensional approximation of a high dimensional 
system can be manipulated with a simpler controller, and can be simulated at lower computational cost. 
A complex, high dimensional system may even be replaced by a simpler model all together leading to 
significant cost savings, as in circuit design, while the "important variables" of a system might shed light 
on underlying physical or biological processes. Reduction of linear dynamical systems has been treated 
with some success to date. As we describe in more detail below, model reduction in the linear case 
proceeds by reducing the dimension of the system with an eye towards preserving its essential input- 
output behavior, a notion directly related to "balancing" observability and controllability of the system. 
The nonlinear picture, however, is considerably more involved. 

In this paper we propose a scheme for balanced model-order reduction of general, nonlinear control 
systems. A key, and to our knowledge, novel point of departure from the literature on nonlinear model 
reduction is that our approach marries approximation and dimensionality reduction methods known to 
the machine learning and statistics communities with existing ideas in linear and nonlinear control. 
In particular, we apply a method similar to kernel PCA as well as function learning in Reproducing 
Kernel Hilbert Spaces (RKHS) to the problem of balanced model reduction. Working in RKHS provides 
a convenient, general functional-analytical framework for theoretical understanding as well as a ready 
source of existing results and error estimates. The approach presented here is also strongly empirical, 
in that observability and controllability, and in some cases the dynamics of the nonlinear system are 
estimated from simulated or measured trajectories. This emphasis on the empirical makes our approach 
broadly applicable, as the method can be applied without having to tailor anything to the particular form 
of the dynamics. 

The approach we propose begins by constructing empirical estimates of the observability and controlla- 
bility Gramians in a high (or possibly infinite) dimensional feature space. The Gramians are simultaneously 
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diagonalized in order to identify directions which, in the feature space, are both the most observable and the 
most controllable. The assumption that a nonlinear system behaves linearly when lifted to a feature space 
is far more reasonable than assuming linearity in the original space, and then carrying out the linear theory 
hoping for the best. Working in the high dimensional feature space allows one to perform linear operations 
on a representation of the system's state and output which can capture strong nonlinearities. Therefore a 
system which is not model reducible using existing methods, may become reducible when mapped into 
such a nonlinear feature space. This situation closely parallels the problem of linear separability in data 
classification: A dataset which is not linearly separable might be easily separated when mapped into a 
nonlinear feature space. The decision boundary is linear in this feature space, but is nonlinear in the 
original data space. 

Nonlinear reduction of the state space already opens the door to the design of simpler controllers, but 
is only half of the picture. One would also like to be able to write a closed, reduced dynamical system 
whose input-output behavior closely captures that of the original system. This problem is the focus of the 
second half of our paper, where we again exploit helpful properties of RKHS in order to provide such a 
closed system. 

The paper is organized as follows. In the next section we provide the relevant background for model 
reduction and balancing. We then adapt and extend balancing techniques described in the background to 
the current RKHS setting in Section III Section IV then proposes a method for determining a closed, 
reduced nonlinear control system in light of the reduction map described in Section III Finally, Section [V] 
provides experiments illustrating an application of the proposed methods to a specific nonlinear system. 



II. Background 

Several approaches have been proposed for the reduction of linear control systems in view of control, 
but few exist for finite or infinite-dimensional controlled nonlinear dynamical systems. For linear systems 
the pioneering "Input- Output balancing" approach proposed by B.C. Moore observes that the important 
states are the ones that are both easy to reach and that generate a lot of energy at the output. If a large 
amount of energy is required to reach a certain state but the same state yields a small output energy, the 
state is unimportant for the input-output behavior of the system. The goal is then to find the states that are 
both the most controllable and the most observable. One way to determine such states is to find a change 
of coordinates where the controllability and observability Gramians (which can be viewed as a measure 
of the controllability and the observability of the system) are equal and diagonal. States that are difficult 
to reach and that don't significantly affect the output are then ignored or truncated. A system expressed in 
the coordinates where each state is equally controllable and observable is called its balanced realization. 

A proposal for generalizing this approach to nonlinear control systems was advanced by J. Scherpen [|24|. 
where suitably defined controllability and observability energy functions reduce to Gramians in the linear 
case. In general, to find the balanced realization of a system one needs to solve a set of Hamilton-Jacobi and 
Lyapunov equations (as we will discuss below). Moore [fT9l proposed an alternative, data-based approach 
for balancing in the linear case. This method uses samples of the impulse response of a linear system to 
construct empirical controllability and observability Gramians which are then balanced and truncated using 
Principal Components Analysis (PCA, or POD). This data-driven strategy was then extended to nonlinear 
control systems with a stable linear approximation by Lall et al. lfT5ll . by effectively applying Moore's 
method to a nonlinear system by way of the Galerkin projection. Despite the fact that the balancing theory 
underpinning their approach assumes a linear system, Lall and colleagues were able to effectively reduce 
some nonlinear systems. 

Phillips et al. [|22ll has also studied reduction of nonlinear circuit models in the case of linear but 
unbalanced coordinate transformations and found that approximation using a polynomial RKHS could 
afford computational advantages. Gray and Verriest mention in [8|| that studying algebraically defined 
Gramian operators in RKHS may provide advantageous approximation properties, though the idea is not 
further explored. Finally, Coifman et al. flU discuss reduction of an uncontrolled stochastic Langevin 
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system. There, eigenfunctions of a combinatorial Laplacian, built from samples of trajectories, provide 
a set of reduction coordinates but does not provide a reduced system. This method is related to kernel 
principal components (KPCA) using a Gaussian kernel, however reduction in this study is carried out on 
a simplified linear system outside the context of control. 

In the following section we review balancing of linear and nonlinear systems as introduced in [[T9l 
and 11241 . 



A. Balancing of Linear Systems 
Consider a linear control system 



x = Fx + Gu, m 
y = Hx, ' { } 



where (F, G) is controllable, (F, H) is observable and F is Hurwitz. We define the controllability and 
the observability Gramians as, respectively, 

W c = j™e Ft GG T e FTt dt, 
W = f™e FTt H T He Ft dt. 

These two matrices can be viewed as a measure of the controllability and the observability of the 
system [fT9ll . For instance, consider the past energy [24], L c (xq), defined as the minimal energy required 
to reach x from in infinite time 

L c (x )= inf \ f \\u(t)\\ 2 dt, (2) 

x(—co)=0,x(0)=xq 

and the future energy ll24l . L o (x ), defined as the output energy generated by releasing the system from 
its initial state x(to) = xq, and zero input u(t) = for t > 0, i.e. 



CO 



Lo{x,) = \ J \\y(t)\\ 2 dt, (3) 

for a; (to) = an d u(t) = 0,t > 0. In the linear case, it can be shown that L c (x ) = ^XqW^Xq, 
and L {xq) = \xIW q xq. The columns of W c span the controllable subspace while the nullspace of 
W coincides with the unobservable subspace. As such, W c and W (or their estimates) are the key 
ingredients in many model reduction techniques. It is also well known that W c and W satisfy the Lyapunov 
equations [fT9l 

FW C + W C F T = -GG T , 
F T W + W F = —H T H. 

Several methods have been developed to solve these equations directly llT6l . ifTTl . 

The idea behind balancing is to find a representation where the system's observable and controllable 
subspaces are aligned so that reduction, if possible, consists of eliminating uncontrollable states which 
are also the least observable. More formally, we would like to find a new coordinate system such that 

W c = W = S = diag{oi, • • • , a n }, 

where o\ > cr 2 > ■ ■ ■ > cr n > 0. If (F, G) is controllable and (F, H) is observable, then there exists a 
transformation such that the state space expressed in the transformed coordinates (TFT -1 ,TG, HT^ 1 ) 
is balanced and TW C T T = T~^W Q T~ X = E. Typically one looks for a gap in the singular values {o-j} for 
guidance as to where truncation should occur. If we see that there is a k such that 3> (Jk+i, then the 
states most responsible for governing the input-output relationship of the system are (xi, • • • ,Xk) while 
(xk+i, ■ ■ ■ ,x n ) are assumed to make negligible contributions. 

If A is unstable then the controllability and observability quantities defined in ([2]) are undefined since 
the integrals will be unbounded. There may, however, still exist solutions to the Lyapunov equations Q 
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when A is unstable, and these solutions will be unique if and only if A (A) + X(A) ^ 0. In this case 
balancing may be carried out as usual by finding (if possible) a transformation T such that W c = W = £ 
where £ is again diagonal and positive semidefinite I1261 , IfTOl . Other approaches to balancing unstable 
linear systems exist (see 0, ll28l for the method of LQG balancing for example). 

Although several methods also exist for computing T lfT6l . ifTTll . it is common to simply compute the 
Cholesky decomposition of W Q so that W = ZZ T , and form the SVD UT, 2 U T of Z T W C Z. Then T is 
given by T = Y^U^Z~ X . We also note that the problem of finding the coordinate change T can be seen 
as an optimization problem JH of the form 

min trace [TW C T* + T^WoT- 1 }. 



B. Balancing of Nonlinear Systems 

In the nonlinear case, the energy functions L c and L Q in ([2]) and ([3]) are obtained by solving both a 
Lyapunov and a Hamilton-Jacobi equation. Here we follow the development of Scherpen Il24ll . Consider 
the nonlinear system 

" % = f{^) + Y7=x9i{x)u h 

y = h(x), 

with x G R n , u G E m , y G R p , /(0) = 0, &(0) = for 1 < i < m, and h(0) = 0. Moreover, assume the 
following Hypothesis. 

Assumption A: The linearization of (|4]) around the origin is controllable, observable and F = §f | a =o is 
asymptotically stable. 

Theorem 2.1: [|24| If the origin is an asymptotically stable equilibrium of f(x) on a neighborhood W 
of the origin, then for all x G W, L a (x) is the unique smooth solution of 

BL 1 

-^L(x)f(x) + -h T {x)h(x) = 0, L o (0) = (5) 

under the assumption that ^ has a smooth solution on W. Furthermore for all x G W, L c (x) is the 
unique smooth solution of 

^ {x)f{x) + 2-di {x)9{x)9T{x) ^ {x) = ^ Lc(0)= ° (6) 

under the assumption that ^ has a smooth solution L c on W and that the origin is an asymptotically 
stable equilibrium of —(/(#) + g(x)g T (x)^(x)) on W. 

With the controllability and the observability functions on hand, the input-normal/output-diagonal realiza- 
tion of system (|4]) can be computed by way of a coordinate transformation. More precisely, 

Theorem 2.2: [|24l Consider system under Assumption A and the assumptions in Theorem 2.1 Then, 
there exists a neighborhood W of the origin and coordinate transformation x = ip(z) on W converting 
the energy functions into the form 

LcOO)) = -z T z, 



1 - 

L (tp(z)) = - ^zfriizif 



where <Ji(x) > U2{x) > ■ ■ ■ > cr n (x). The functions <Tj(-) are called Hankel singular value functions. 
Analogous to the linear case, the system's states can be sorted in order of importance by sorting the 
singular value functions, and reduction proceeds by removing the least important states. 

In the above framework for balancing of nonlinear systems, one needs to solve (or numerically evaluate) 
the PDEs ([5]), ^ and compute the coordinate change x = f{z), however there are no systematic methods 
or tools for solving these problems. Various approximate solutions based on Taylor series expansions have 
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been proposed |[T3l . |[T2~1|. 0. Newman and Krishnaprasad 11201 introduce a statistical approximation based 
on exciting the system with white Gaussian noise and then computing the balancing transformation using 
an algorithm from differential topology. As mentioned earlier, an essentially linear empirical approach 
was proposed in lfl"5ll . In this paper, we combine aspects of both data-driven approaches and analytic 
approaches by carrying out balancing in a suitable RKHS. 

III. Empirical Balancing of Nonlinear Systems in RKHS 
We consider a general nonlinear system of the form 

J x = f(x,u) 

\ y = h(x) U) 

with x G W 1 , u e R m , y G R p , /(0,0) = 0, and h{0) = 0. Let K(x ) = {x' G R n : 3u G 
L 00 (IR,]R m ) and 3T G [0, oo) such that x(0) = x and x{T) = x'} be the reachable set from 
the initial condition x(0) = xq. 

Hypothesis H: The system ([7]) is zero-state observable, its linearization around the origin is controllable, 
and the origin of x — f(x, 0) is asymptotically stable. 

We treat the problem of estimating the observability and controllability Gramians as one of estimating 
an integral operator from data in a reproducing kernel Hilbert space (RKHS) [2]. Our approach hinges 
on the key modeling assumption that the nonlinear dynamical system is linear in an appropriate high 
(or possibly infinite) dimensional lifted feature space. Covariance operators in this feature space and their 
empirical estimates are the objects of primary importance and contain the information needed to perform 
model reduction. In particular, the (linear) observability and controllability Gramians are estimated and 
diagonalized in the feature space, but capture nonlinearities in the original state space. The reduction 
approach we propose adapts ideas from kernel PCA (KPCA) E51 and is driven by a set of simulated or 
sampled system trajectories, extending and generalizing the work of Moore Ifl9l and Lall et al. Ifl"5l . 

In the development below we lift state vectors of the system ([7]) into a reproducing kernel Hilbert 
space fl2j . 

A. Elements of Learning Theory 

In this section, we give a brief overview of reproducing kernel Hilbert spaces as used in statistical 
learning theory. The discussion here borrows heavily from flU, E3- Early work developing the theory of 
RKHS was undertaken by N. Aronszajn [0. 

Definition 3.1: Let H be a Hilbert space of functions on a set X. Denote by (/, g) the inner product 
on % and let 1 1/| | = (/, /) 1//2 be the norm in H, for / and g £ Ti. We say that H is a reproducing kernel 
Hilbert space (RKHS) if there exists K : X x X ->■ R such that 

i. K has the reproducing property, i.e. V/ G H, f(x) = (f{-),K(-,x)). 

ii. K spans H, i.e. H = span{K(x, -)\x G X}. 

K will be called a reproducing kernel of %. Hk(X) will denote the RKHS % with reproducing kernel 
K, 

Definition 3.2: (Mercer kernel map) A function K : X x X — > R is called a Mercer kernel if it is 
continuous, symmetric and positive definite. 

The important properties of reproducing kernels are summarized in the following proposition 
Proposition 3.1: If K is a reproducing kernel of a Hilbert space H, then 

i. K(x,y) is unique. 

ii. Vx,y G X, K(x,y) = K(y,x) (symmetry). 

iii. YlTj=i otiOLjK(xi,Xj) > for a, ; G R and Xj G X (positive deflnitness). 

iv. (K{x,-),K(y,-)) H = K(x,y). 
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v. Let c ^ 0. The following kernels, defined on a compact domain X C W l , are Mercer kernels: 

\\x-y\\ 2 

K(x,y) = x-y' (Linear), K(x,y) = (1+x-y) , d E IN (Polynomial), K(x, y) = e ~^ , a > 
(Gaussian) . 

Theorem 3.1: Let i^i^x^-^lbea symmetric and positive definite function. Then, there exists a 
Hilbert space of functions W. defined on X admitting K as a reproducing Kernel. 
Conversely, let H be a Hilbert space of functions / : X — > R satisfying 

VxeX,B Kx >0, such that < K x \\f\\ H , V/ E 7i. 

Then, % has a reproducing kernel K. 

Theorem 3.2: Every sequence of functions (/„)„>i which converges strongly to a function / in Hk(X), 
converges also in the pointwise sense, that is, lim n ^oo f n (x) = f(x), for any point x E X. Further, this 
convergence is uniform on every subset of X on which x h-> K(x, x) is bounded. 

Theorem 3.3: Let K(x, y) be a positive definite kernel on a compact domain or a manifold X. Then 
there exists a Hilbert space T and a function $ : X — > T such that 

= for x,yeX. 

$ is called a feature map, and J 7 a feature spaceQ 
Remarks. 



i. In theorem 3.3, and using property [iv.] in Proposition 3.1 we can take := K x := K(x, ■) in 
which case T = H - the "feature space" is the RKHS. 

ii. The fact that Mercer kernels are positive definite and symmetric reminds us of similar properties of 
Gramians and covariance matrices. This is an essential fact that we are going to use in the following. 

iii. In practice, we choose a Mercer kernel, such as the ones in [v.] in Proposition 3.1[ and theorem 3.1 



guarantees the existence of a Hilbert space admitting such a function as a reproducing kernel. 

< 

RKHS play an important in learning theory whose objective is to find an unknown function / : X — > Y 
from random samples (xi, 2/*) I^Li- F° r instance, assume that the random probability measure that governs 
the random samples is p and is defined on Z := X x Y. Let X be a compact subset of R n and Y = R. 
If we define the least square error of / as 

S = [ (f(x)~y) 2 dp, (8) 

JXxY 

then the function that minimzes the error is the regression function f p 

f P ( x ) = / ydp{y\x), x e x, 
Jr 

where p(y\x) is the conditional probability measure on R. Since p is unknown, neither f p nor £ is 
computable. We only have the samples s := (x^yi)]^. The error f p is approximated by the empirical 
error £ s (f) by 

£-(/) = -E^-^ + AII/ll^, (9) 

i=l 

for A > 0, A plays the role of a regularizing parameter. In learning theory, the minimization is taken over 
functions from a hypothesis space often taken to be a ball of a RKHS Hk associated to Mercer kernel 
K, and the function / s that minimizes the empirical error £ s is 

m 

f s = Y,c 3 K(x,x,), (10) 



'The dimension of the feature space can be infinite and corresponds to the dimension of the eigenspace of the integral operator Lk 
C 2 V {X) -t C(X) defined as {L K f){x) = J K(x,t)f(t)dv{t) if K is a Mercer kernel, for / G C? V (X) and v is a Borel measure on X. 
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where the coefficients (cj) 1 Jl =1 is solved by the linear system 

m 

\mc i + S ^2K(x il x j )cj = yi, i = l,---m, (11) 
and / s is taken as an approximation of the regression function f p . Hence, minimizing over the (possibly 



infinite dimensional) Hilbert space, reduces to minimizing over IR m . The series (10) converges absolutely 
and uniformly to /. 

We call learning the process of approximating the unknown function / from random samples on Z. 
In the following, we assume that the kernels K are continuous and bounded by 

K = sup a/ K(x, x) < oo . 

B. Empirical Gramians in RKHS 

Following 030, we estimate the controllability Gramian by exciting each coordinate of the input with 
impulses while setting x = 0. One can also further excite using rotations of impulses as suggested 
in 031, however for simplicity we consider only the original signals proposed in 030. Let u l (t) = 5(t)ei 
be the i-th excitation signal, and let x l {t) be the corresponding response of the system. Form the matrix 
X(t) = [x x (t) • • • x m (t)~\ E M nxm , so that X(t) is seen as a data matrix with column observations given 
by the respective responses x l (t). Then W c E M. nxn is given by 

1 f°° 

W c = — X(t)X(t) r dt. (12) 
™ Jo 

We can approximate this integral by sampling the matrix function X{t) within a finite time interval [0, T] 
assuming the regular partition {ti]f =l ,ti = (T/N)i. This leads to the empirical controllability Gramian 

rp N 

w *=^Y, x M x <ti T ' (13) 

i=i 

As described in 030, the observability Gramian is estimated by fixing u{t) = 0, setting x = e% for 
i = l,...,n, and measuring the corresponding system output responses y l (t). As before, assemble the 
responses into a matrix Y(t) = \y l (t) ■ ■ ■ y n (t)] e W xn . The observability Gramian W Q E IR nxri and its 
empirical counterpart W are given by 

1 f°° 

W = - I Y(t) T Y(t)dt (14) 

and 



PJo 



T A 



where Y(t) = Y(t) T . The matrix Y(U) E IR nxp can be thought of as a data matrix with column observations 

d j (t i )={y](t i ),...,y?(t i )) T eR n , j = 1, . . . ,p, i = 1, . . . , N (16) 

so that dj(U) corresponds to the response at time U of the single output coordinate j to each of the 
(separate) initial conditions x — e k , k — 1, . . . , n. This convention will lead to greater clarity in the steps 
that follow. 
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C. Kernel PCA 

Kernel PCA ll25l will be a helpful starting point for understanding the approach to balanced reduction 
introduced in this paper. We briefly review the relevant background here. Kernel PCA (KPCA) generalizes 
linear PCA by carrying out PCA in a high dimensional feature space defined by a feature map $ : IR n — > T . 
Taking the feature map = K x and given the set of data x := {xi]f =1 G W 1 , we can consider PCA 
in the feature space by simply working with the covariance of the mapped vectors, 

1 - 

C x = -2^$(^)®$(^), (17) 

i=l 

where $(xj) <8> = ($(xj), •) denotes the tensor product between two vectors in H. We will 

assume the data are centered in the feature space so that J2i^( x i) = 0- ^ not ' data ma y be centered 
according to the prescription in ll25ll . 

The principal subspaces are computed by diagonalizing C x , however as is shown in [|25l . one can 
equivalently form the matrix K e Hl NxN of kernel products (K)^ = K{x^ xj) for i, j = 1, . . . , N, and 
solve the eigenproblem 

Kol = NXa. (18) 



If- 
then we have that 



C x Vi = XiVi, (19) 



(20) 



where ^ := (&(xi) • • • Q(xn)), and the non-zero eigenvalues of K and C x coincide. 

The eigenvectors oti of are then normalized so that the eigenvectors Vi of C x have unit norm in the 

1 1 2 1 

feature space, leading to the condition ||otj|| = A 4 . Assuming this normalization convention, sort the 
eigenvectors according to the magnitudes of the corresponding eigenvalues in descending order, and form 
the matrix 

A q = [oti ■■■ a q ] , 1 < q < min(n, N). (21) 

Similarly, form the matrix V q = [v± ■ ■ ■ v 9 ] , 1 < q < n of sorted eigenvectors of C x . The first q 
principal components of a vector x = in the feature space are then given by V q T x. It can be shown 
however (see [|25l ) that principal components in the feature space can be computed in the original space 
with kernels using the map 

U(x) := A T q k(x), (22) 

where k(x) = (K(x, xi), . . . , K(x, xn)) T - 



D. Model Order Reduction Map 

The method we propose consists, in essence, of collecting samples and then performing a process 
similar to "simultaneous principal components analysis" on the controllability and observability Gramian 
estimates in the (same) RKHS. As mentioned above, given a choice of the kernel K defining a RKHS H, 
principal components in the feature space can be computed implicitly in the original input space using K. 
It is worth emphasizing however that we will be co-diagonalizing two Gramians in the feature space by 
way of a non-orthogonal transformation; the process bears a resemblance to (K)PCA, and yet is distinct. 
Indeed the favorable properties associated with an orthonormal basis are no longer available, the quantities 
we will in practice diagonalize are different, and the issue of data-centering must be considered with some 
additional care. 
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First note that the controllability Gramian W c can be viewed as the sample covariance of a collection 
of N • m vectors, scaled by T 

rp N N m 

«=i j=i j=i 

and the observability Gramian can be similarly viewed as the sample covariance of a collection of N ■ p 
vectors 

rp N V 

P i=l j=l 



where the dj are defined in Equation ( fl6| ). 

We can thus consider three quantities of interest: 

• The controllability kernel matrix K c G ]^ NmxNm of kernel products 

= K( Xfl , x v ) = ($«), <$>(x v )) T (25) 

for fx, v — 1, . . . , iVm where we have re-indexed the set of vectors {x^(ti)}ij = {x^}^ to use a single 
linear index. 

• The observability kernel matrix K Q G ^ n p xN p^ 

(Ko)^ = K(d„ d v ) = , ®{d v )) T (26) 

for /j,, v — 1, . . . , Np, where we have again re-indexed the set {dj(U)} it j = {d^}^ for simplicity. 
. The Hankel kernel matrix K Qfi G R N P* Nm t 

(K 0>e )^ = K(d„ x u ) = ($(d M ), $M) T (27) 

for /i = 1, . . . , iVp, u = 1, . . . , Nm. 
We have chosen the suggestive terminology "Hankel kernel matrix" above because the square-roots of 
the nonzero eigenvalues of the matrix K 0fi Kj c are the empirical Hankel singular values of the system 
mapped into feature space, where we assume the system behaves linearly. This assertion will be proved 
immediately below. Note that ordinarily, Nm, Np 3> n and K c , K Q will be rank deficient. 

Before proceeding we consider the issue of data centering in feature space. PCA and kernel PCA 
assume that the data have been centered in order to make the problem translation invariant. In the setting 
considered here, we have two distinct sets of data: the observability samples and the controllability 
samples. A reasonable centering convention centers the data in each of these datasets separately. Let ^ 
denote the matrix whose columns are the observability samples mapped into feature space by $, and 
let $ be the matrix similarly built from the feature space representation of the controllability samples. 
Then K Q = \I> T \I> , K c = <fr T <fr and K 0jC = ^ T $. Assume for the moment that there are M observability 
data samples and N controllability samples, and let Ijv, 1m denote the length N, M vectors of all ones, 
respectively. We can define centered versions of the feature space data matrices <fr, ^ as 

$ = $-fi c ll, * = *-/Um (28) 

where pL c := N -1 $1jv and fi Q := M^ 1 ^1m- We will need two centered quantities in the development 
below. The first centered quantity we consider is the centered version of K oc , namely K o c = ^ <fr. 
Although one cannot compute \x c , fM explicitly from the data, we can compute K 0>c by observing that 

K DiC = (* - /i l T M ) T (® - n e lJr) 

= K oc — -\jK Qfi l N lli — j-fl M l M K 0>c + j^1m1mKo,c1n1~n- (29) 
The second quantity we'll need a centered version of the empirical observability feature map 

k (x) := * T $(x) = (K(x, d t ), K(x, d M )) T (30) 
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where x E W 1 is the state variable and the observability samples {dj} are again indexed by a single 
variable as in Equation ([26]). Centering follows reasoning similar to that of the Hankel kernel matrix 
immediately above: 

k a (x) = (* - ^ D l^) T ($(x) - /i c ) 

= k Q (x) - jjK 0tC l N - il M l^-k (a;) + -^l^l^i^l/v. (31) 

Note: Throughout the remainder of this paper we will drop the special notation K 0iC ,k (x) and assume 
that K o c , k (x) are centered appropriately. 

With the quantities defined above, we can co-diagonalize the empirical Gramians (balancing) and reduce 
the dimensionality of the state variable (truncation) in feature space by carrying out calculations in the 
original data space. As the system is assumed to behave linearly in the feature space, the order of the 
model can be reduced by discarding small Hankel values {T,n}™ =q+1 , and projecting onto the subspace 
associated with the first q < n largest eigenvalues. The following key result describes this process: 

Theorem 3.4 (Balanced Reduction in Feature Space): Balanced reduction in feature space can be ac- 
complished by applying the state- space reduction map II : W 1 — > M. q given by 

U(x) = Tjk (x), xGl" (32) 

where T q = V q E q 1/2 if K 0>c Kj c = VY, 2 V T , and k ( x) is the empirical observability feature map. 

Proof: We assume the data have been centered in feature space. Let $ be a matrix with columns 
{^(^ (xj))}, i — 1, . . . , N, j — 1, . . . , m, so that X = <fr<fr T is the feature space controllability Gramian 
counterpart to Equation (|23). Similarly, let * be a matrix with columns {&(dj(ti))} , i = 1, . . . ,N,j = 



l,...,p, so that Y = is the feature space observability Gramian counterpart to Equation ( |24| ). 

Since by definition K{x,y) = (&(x), ^{y))^, we also have that K c = <fr T <3? and K Q = \t ,T \t r . In general 
the Gramians X, Y are infinite dimensional whereas the kernel matrices K c , K Q are necessarily of finite 
dimension. 

We now carry out linear balancing on (X, Y) in the feature space (RKHS). First, take the SVD of 
X 1 / 2 * so that 

UJ:V T = X 1/2 * (33) 
UT, 2 U T = (X 1/2 *)(X 1/2 *) T = X 1/2 YX 1/2 (34) 
VY?V T = (X 1/2 *) T (X 1/2 *) = * T $$ T * = K 0>c Kj ;C . (35) 

The last equality in Equation ( f34| ) follows since X is symmetric and therefore X 1 / 2 is too. The lin- 
ear balancing transformation is then given by M = T I 1 ^ 2 U T X^ 1 ^ 2 , and one can readily verify that 
MXM T = M~ T Y M~ l = S. Here, inverses should be interpreted as peudo-inverses when appropriate)^] 
From Equations @-([35]), we see that U T = E^V^^X 1 / 2 and thus M = T,^/ 2 V T ^ T . We can project 



an arbitrary mapped data point onto the (balanced) "principal" subspace of dimension q spanned by 
the first q rows of M by computing 

M q $(x) = ^ q 1/2 V q T ^(x) = T,- x ' 2 VjK(x) (36) 

where k c (x) := ^ T $(x) is the empirical observability feature map, recalling that V q is the matrix formed 
by taking the top q eigenvectors of K 0:C Kj c by Equation ( |35] ). 



We note that square roots of the non-zero eigenvalues of K QtC Kj c are exactly the Hankel singular values of 
the system mapped into the feature space, under the assumption of linearity in the feature space. This can 
be seen by noting that \+(YX) = \ + (X 1/2 YX 1/2 ) = A + (X 0iC X^ c ), where A + (-) refers to the non-zero 
eigenvalues of its argument. 



In Section IV below we show how to use the nonlinear reduction map (32) to realize a closed, reduced 



order system which can approximate the original system to a high degree of accuracy. 

2 Such as in the case of X _1//2 when the number of data points is less than the dimension of the RKHS. 
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IV. Closed Dynamics of the Reduced System 

Given the nonlinear state space reduction map II : IR ra — >■ W, a remaining challenge is to construct 
a corresponding (reduced) dynamical system on the reduced state space which well approximates the 
input-output behavior of the original system on the original state space. Setting x r = TL(x) and applying 
the chain rule, 

x r = (J u {x)f{x,u))\ x=W{xr) (37) 

where W refers to an appropriate notion (to be defined) of the inverse of II. However we are faced 
with the difficulty that the map II is not in general injective (even if q = n), and moreover one cannot 
guarantee that an arbitrary point in the RKHS has a non-empty preimage under $ [TT8l . We propose an 
approximation scheme to get around this difficulty: The dynamics / will be approximated by an element 
of an RKHS defined on the reduced state space. When / is assumed to be known explicitly it can be 
approximated to a high degree of accuracy. An approximate, least-squares notion of W will be given to 
first or second order via a Taylor series expansion, but only where it is strictly needed - and at the last 
possible moment - so that a first or second order approximation will not be as crude as one might suppose. 
We will also consider, as an alternative, a direct approximation of J n (n^(x r )) which takes into account 
further properties of the reproducing kernel as well as the fact that the Jacobian is to be evaluated at 
x = W(x r ) in particular. In both cases, the important ability of the map LI to capture strong nonlinearities 
will not be significantly diminished. 

A. Representation of the dynamics in RKHS 

The vector-valued map / : IR n x IR m — » IR n can be approximated by a composing a set of n regression 
functions (one for each coordinate) f : W xm — > R in an RKHS, with the reduction map IT. It is reasonable 
to expect that this approximation will be better than directly computing f(W(x r ),u) using, for instance, a 
Taylor expansion approximation for W which may ignore important nonlinearities at a stage where crude 
approximations must be avoided. 

Let x = II(x) denote a reduced state variable, and concatenate the input examples Xj = H{xj) E 
W,Uj E M m so that Zj = (xj,Uj) E W xm , and {{fi{xj,Uj),Zj)}j =1 is a set of input-output training pairs 
describing the i-th coordinate of the map (x,u) !->■ f(x,u). The training examples should characterize 
"typical" behaviors of the system, and can even re-use those trajectories simulated in response to impulses 
for estimating the Gramians above. We will seek the function f E H which minimizes 

t 

^2{M z j) - fi(zj,Uj)) + M\M\n 

where Aj here is a regularization parameter. We have chosen the square loss, however other suitable loss 
functions may be used. It can be shown [27] that in this case fi takes the form f(z) = Yl]=i c )K^{ z i z j)^ = 
1, . . . , n, where defines the RKHS %f (and is unrelated to K used to estimate the Gramians). Note that 
although our notation takes the RKHS for each coordinate function to be the same, in general this need 
not be true: different kernels may be chosen for each function. Here the {c* } comprise a set of coefficients 
learned using the regularized least squares (RLS) algorithm. The kernel family and any hyper-parameters 
can be chosen by cross-validation. For notational convenience we will further define the vector-valued 
empirical feature map 

for i = 1, ...,£. In this notation f(Tl(x),u) = cjk^(x,u) where (c»)j = A. 

A broad class of systems seen in the literature [24j are also characterized by separable dynamics of the 
form x = f(x) + Y^iLi 9i( x ) u i- m this case one need only estimate the functions / and g, t from examples 

{(nfoOi/fri))}*- and {( n (zj),£(zi))}j- 
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B. Approximation of the Jacobian Contribution 

We turn to approximating the component Jn(W(x r )^ appearing in Equation ( |37] ). 

1 ) Inverse-Taylor Expansion: A simple solution is to compute a low-order Taylor expansion of II and 
then invert it using the Moore-Penrose pseudoinverse to obtain the approximation. For example, consider 
the first order expansion IL(x) ~ U(a) + Ju(a)(x — a). Then we can approximate lP(x r ) (in the first-order, 
least-norm sense) as 

nt(av) := (J u (a))\x r - 11(a)) + a. (38) 

We may start with a = xq, but periodically update the expansion in different regions of the dynamics 
if desired. A good expansion point could be the estimated preimage of x r (t) returned by the algorithm 
proposed in lfl4l . If k D (x) is the centered version of the length M vector k (x) defined by < |30] ) , then 

where 1 M is the length M vector of all ones. An example calculation of (d x \s. Q {x)) i = d x K(x, dj) in the 
case of a polynomial kernel is given in the section immediately below. 

2) Exploiting Kernel Properties: For certain choices of the kernel K defining the Gramian feature 
space H, one can exploit the fact that K x and its derivative bear a special relationship, and potentially 
improve the estimate for Jn(II^(x r )). Perhaps the most commonly used off-the-shelf kernel families are 
the polynomial and Gaussian families. For any two kernels with hyperparameters p and q (respectively) 
in one of these classes, we have that K v = (K q ) p l q . We'll consider the polynomial kernel of degree d, 
K d (x,y) := (1 + (x,y)) d in particular; the Gaussian case can be derived using similar reasoning. For a 
polynomial kernel we have that 

dK ff ] = dK d ^(x,y)y T =d{K d (x,y)) d -^y T . 

Recalling that K d (x,y) = (&(x),$(y)) n and x r = U(x) = VjQ(x), if II were invertible then we would 
have 



dK d (x,y) 



dx 



d-l 



= d(($on- 1 )(x r ),$( !/ )) d y T . 

=n-i(av) 



The map II is not injective however, and in addition the fibers of $ may be potentially empty, so we 
must settle for an approximation. It is reasonable then to define o XT') (x r ) as the solution to the convex 
optimization problem 

min \\z\\«j 

x&i ( 39) 
subj. to \\M q z — x r \\ Rk = 

where M q : H — > M fc is defined as in Equation ([36]). If a point z E % has a pre-image in IR n this definition 



is consistent with composing $ with the formal definition ^ 1 (z) = {x G 1" | = z] and noting 

that in this case II o = M 9 ($ o $ _1 ) = M q z. Furthermore, a trajectory x r (t) of the closed dynamical 
system on the reduced statespace need not (and may not) have a counterpart in the original statespace by 
virtue of the way in which IT is used in our formulation of the reduction map and corresponding reduced 
dynamical system. 



One will recognize that the solution z* to d39l) is just the Moore-Penrose pseudoinverse z* = Mj 



Inserting this solution into the feature map representation of a kernel K gives the following definition for 
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K{W(x r ),y): 

= (Mlx r ^(y)) n = (x r ,(Mj)^(y)\ 
= (x r ,(M q M q T )- 1 M q <S>(y)) 
= (x r ,(M q Mj)- 1 Tl(y)) 
= (x r ,(TjK T q )- 1 U(y)) 



where the final equality follows applying Equations ([33])-([35]) and T q is defined as in Theorem 3.4 
Substituting into the derivative for a polynomial kernel K = K d gives 



dK d (x, y) 



dx 



= d(x r ,(T q T K T q )- 1 U(y)) d ^y T 

~-W(x r ) 



which immediately gives an expression for Jji(W(x r )). Note that this approximation is global in the sense 
that the q x q matrix inverse (TjKoTq)^ 1 need only be computed oncepl no updating is required during 
simulation of the closed system. 

C. Reduced System Dynamics 

Given an estimate f(ll(x),u) of f(x,u) in the RKHS %f and a notion of J n (n^(x r )) from above, we 
can write down a closed dynamical system on the reduced statespace. We have 

x r « (j u (x)f(U(x),u)) 

x=W(x r ) 

~ {Ju(x))\ x=nHxr) C T y(x r ,u) 

&Tjj k (tf{x r ))C T y(x r ,u) (40) 

where C is a matrix with the vectors Cj as its rows, and ,\ is the Jacobian of the empirical feature map 
defined in Equation ([30]). Here the expression Jk(n^(x r )) should be interpreted as notation for either of 



the Jacobian approximations suggested in Section [IV-B[ 

Equation < |40"1 ) is seen to give a closed nonlinear control system expressed solely in terms of the reduced 
variable x r E W: 

x r = T g T J k (n t (x r ))C T k / (x r ,n) 
y = h(x r ) 

where the map lioll modeling the output function h : IR n — > W is estimated as described immediately 
below. Although the "true" reduced system does not actually exist due to non-injectivity of the feature 
map $, in many situations one can expect that the above system will capture the essential input-output 
behavior of the original system. We leave a precise analysis of the error in the approximations appearing 



in (40) to future work. 



D. Outputs of the Reduced System 

Analogous to the case of the dynamics /, we are faced with two possibilities for approximating y = 
h(W(x r )). We c an app ly a crude Taylor series approximation to estimate 17^ and therefore h(W(x r )), 



or as in Section 



IV-A | we can estimate a map (h o IT) : R n —> W, x r h-> y from the reduced state 



space to the output space directly, using RKHS methods. Given samples {H(xj),yj}j =1 , each coordinate 



We use the word "inverse" loosely. In practice one would use a numerically stable method, such as an LU-factorization, which can be 
used to rapidly compute A~ 1 b for fixed A but many different b. 
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function (hi) P ._ 1 is given in the familiar form hi(H(x)) = X^=i bjK h (H.(x), H(xj)), where K h is the 
kernel chosen to define the RKHS, and may be different for each coordinate. 

It should be noted that just given the state space reduction map II, one can immediately compare the 
output of the system defined by h(x r ) to the original system without defining a closed dynamics as above. 
In fact with II and h one can design a simpler controller which takes as input the reduced state variable 
x r , but controls the original system. 



E. Structural Properties of the Reduced Order System 

In this section, we show that a linearization of the reduced order system preserves the important structural 
properties of a linearization of the full order system. We leave the study of the structural properties of 
the reduced nonlinear system for future work. 

We first note that the approach introduced in this paper reduces to Moore's approach [fT9l if the system 
is linear and one adopts the linear kernel K(x,y) = (x,y). For instance, consider the linear control 
system ([I]). If we take $(x) = x, then the empirical controllability and observability Gramians in feature 



space are exactly the Gramians (|23])-(|24[), as introduced in Moore's work IU91 page 21]. The feature space 
is the original data space, so balancing in the RKHS as explained above reduces to Moore's notion of 
balancing. In this case one obtains reduced order system dynamics via a Galerkin projection rather than 
by attempting to statistically estimate the (linear) dynamics and output functions^} 

The following brief Proposition shows that the reduced order system obtained using the methods 
proposed here preserves important properties of the original system. 

Proposition 4.1: If the linearization of the full order system ([7]) is controllable/observable/Hurwitz then 



the linearization of the reduced order system (41 ) is controllable/observable/Hurwitz 



Proof: From (37) and (40), the dynamics of the reduced order system is given by 

Xr = {Ju{x)f{x,u))\ x=W{xr) , 

w (Jn(x)/(n(x),«)) 

x=W (x r ) 

« (Mx))\ x=nHxr) c T y(x r ,u) 

« T g T J k (n t (x r ))C T k / (x r ,M) 

= A r x r + B r u + 0(x r , u) 2 , (42) 



and the output is given by 



y = h(x r ) = C r x r + 0(x r y 



where (A r , B r , C r ) G W xq x ]R mX| 3 x W xp . Because our approach reduces to Moore's approach in the 
linear case, then using Moore's results in |[T9l we may conclude that if the linearization of the full order 
system is asymptotically stable, controllable, and observable (which is the case under assumption H), then 
A r is Hurwitz, (A r , B r ) is controllable and (A r ,C r ) is observable. ■ 



F. Algorithm Summary 

To summarize, the approach we have proposed proceeds as follows 

1) Given a nonlinear control system Q, let u l {t) = 5{t)ei be the i-th excitation signal for i = 1, . . . , m, 
and let x l (t) : t e [0, oo) i-> x l (t) G W 1 be the corresponding response of the system. Run the system 
and sample the trajectories at times {tj}^ =1 to generate a collection of N -m vectors {x % (tj) G W 1 }. 

4 That is, whenever the system is known. If not, one will still of course need to estimate the system statistically by sampling trajectories 
and applying the procedure outlined above. 
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2) Fixing u{t) = and setting x = e« for z = 1, . . . , n (separately), measure the corresponding system 
output responses y l (t) : t G [0, oo) i-)- G M p . As before, sample the responses at times {tj}^ =1 
and save the collection of N ■ p vectors {<i fe (tj)} defined as 

dkit j ) = (y 1 k (t j ),...,y%(t j )) T <=R n , k = 1, . . . ,p, j = 1, . . . , N (43) 

3) Choose a kernel K defining a RKHS H, and form the Hankel kernel matrix K 0>c G R N P* Nm , 

{K ,c)^ = K(dp, x v ) fi = l,...,Np, v = l,...,Nm (44) 

where we have re-indexed the sets {dk(U)} = {d^}, {x l (tj)} = {x u } to use single indices. 

4) Compute the eigendecomposition K 0tC Kj c = VT?V T assuming K Qfi has been centered according 
to Equation ( f29| ). 

5) The order of the model is reduced by discarding small eigenvalues {S^}" +1 , and projecting onto 
the subspace associated with the first q < n largest eigenvalues. This leads to the state-space 
reduction map II : R n — > W given by 

U(x) =Tjk (x), i£l" (45) 

— 1/2 

where T q = VgT.q ' and k Q (x) is the centered empirical observability feature map given by 
Equation pi) . 

6) From input/output pairs or simulated/measured trajectories, learn approximations of the dynamics 
and output function defined on the reduced state space using, for instance, the method described 
in section §III-A| (equations {fT0])-(fTT|). The RKHS used to approximate these functions need not be 



the same as the RKHS in which balanced truncation was carried out. 



7) Approximate the Jacobian contribution as described in section ^IV-B 



8) Combine the approximations to determine an expression for a closed, reduced, nonlinear dynamical 
system as described in sections 



IV-C and 8. IV-D 



V. Experiments 

We demonstrate an application of our method on two examples appearing in ||2T1 (Examples 3.1. and 
3.2, pgs. 52-54). 



A. Model Systems 

1 ) Two-Dimensional Exactly Reducible System: Consider the nonlinear system 

x 1 = 

y = 2x 1 -x 2 . 

It can be shown that this system has the same input-output relationship as the system y = —y 3 



X l X2 + 2x\x 2 — x 2 
2x\ — \Qx\x2 + lCteiX^ — — u 



u by 



-(2x1 - £2) 2 £i + (xi - x 2 ) 



x 2 ) — u 



rearranging terms so that 

Xi = 

x 2 = —{2xi - x 2 ) z x 2 + 2(x 1 
y = 2xi-x 2 . 

Defining the new variables Z\ = 2x\ — X2 and z 2 = x\ — x 2 , the system can then be re-written 

z 2 

V = Zi. 

It can be seen that the variable z 2 may be truncated because it doesn't appear in the expression of the 
output and thus doesn't affect Z\, 



-zf + u 
-z\z 2 



z 2 +u 
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Control Input 




0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 

Time (s) 
Control Input 
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Time (s) 

Original and Reduced System Responses 




J t- 



0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 

Time (s) 



Fig. 1. (Top) Simulated output trajectories for the original and reduced 2-dimensional system. (Bottom) Simulated output trajectories for 
the original and reduced 7 -dimensional system. 
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Fig. 2. Largest Hankel singular values of the 2- and 7 -dimensional systems in feature space, plotted in descending order. Note that the 
ordinate axis follows a logarithmic scale. 



2) Seven-Dimensional System: We will also consider a 7-dimensional nonlinear system with one 
dimensional input and output: 

X\ = —x\ + u x 2 — —x\ — x\x 2 + 3xixl — u 

X3 = —0:3 + £5 + U £4 = — x\ + X\ — X2 + X3 + 2u 

X5 = X1X2X3 — x\ + U Xq = X5 — Xg — £5 + 2u 

x 7 = — 2xg + 2x 5 — x 7 — x\ + 4m 
y = X\ — x\ + £3 + X4X3 + x 5 — 2x 6 + 2x 7 



B. Experimental Setup 

For both systems impulse and initial-condition responses of the system were simulated as described 
above, and 800 samples equally spaced in the time interval [0, 5s] were sampled to build the Hankel kernel 
matrix K oc given by the third degree polynomial kernel K(x,y) = (1 + (x,y)) 3 . For the 2-D system we 
retained one component, and for the 7-D system we retained two for the sake of variety. Thus the reduction 
map II was defined by taking the top one or two eigenvectors (scaled columns of T) corresponding to 
the largest Hankel singular values, giving a reduced state space of dimension one or two for the 2-D and 
7-D systems, respectively. 



Next, a map from the reduced variable x r to x was estimated following Section IV-A The same 



procedure was followed in both experiments. The control input was chosen to be a lOhz square wave 
with peaks at ±1 at 50% duty cycle, and 1000 samples from the simulated system in the interval [0,5s] 
were mapped down using II and then used to solve the RLS regression problems, one for each state 
variable, again using a third degree polynomial kernel. All initial conditions were set to zero. The desired 
outputs (dependent variable examples) used to learn / were taken to be the true function / evaluated 
at the samples from the simulated state trajectory. We also added a bias dimension of l's to the data 



18 



to account for an offset, and used a fast leave-one-out cross-validation (LOOCV) computation E3ll to 
select the optimal regularization parameter. Two remarks are in order. The above dynamics can in fact 
be represented explicitly and exactly in a 3rd degree polynomial RKHS; only monomials up to degree 3 
appear in the dynamics. Second, the control input is decoupled from the state. Both of these facts can be 
used to obtain an improved reduced model, however we did not make use of these special properties and 
instead applied the simplest version of the techniques described above which assume no special structure. 

We followed a similar process to learn the output function y = h(x r ) for both systems. Here we used 
a 10Hz square wave control input (peaks at ±2, 50% duty cycle), zero initial conditions and 700 samples 
in the interval [0,5s]. For this function the Gaussian kernel K(x,y) = exp(— j\\x — y\\l) was used to 
demonstrate that our method does not rely on any particular match between the form of the dynamics 
and the type of kernel. The scale hyperparameter 7 was chosen to be the reciprocal of the average 
squared-distance between the training examples. We again used LOOCV to select the RLS regularization 
parameter. 

Finally, closed systems were simulated as described above using xo = and a control input different 
from those used to learn the dynamics and output functions: u(t) = |(sin(27r3t) + sq(27r5t — 7r/2)) where 
sq(-) denotes the square wave function. This input is shown at the top of both simulation summaries in 
Figure [T] The Taylor series approximation for II was done once, about xq, and was not updated further. 

C. Results 

Figure [2] shows the top seven Hankel singular values in the feature space for the two problems on a log 
scale. One can see that, for both systems, even a single component ought to capture most of the system's 
behavior. Further simulations of the 7-D system with a single component showed only a small amount 
of additional error beyond that the two component system, as one would expect from the decay of that 
system's Hankel values. 

The simulated outputs y(t) of the closed reduced systems as well as the output y(t) of the original system 
are plotted in Figure [T] (left, 2-D system ; right, 7-D system). One can see that, even for a significantly 
different input, the reduced systems closely capture the original systems. The main source of error is 
seen to be over- and under- shoot near the square wave transients. This error can be further reduced by 
simulating the system for different sorts of inputs (and/or frequencies) and including the collected samples 
in the training sets used to learn II, / and h. Indeed, we have had some success driving example systems 
with random uniform input in some cases. 

Finally, for illustrative purposes we show examples of the controllability and observability kernel 
matrices K c and K Q for the 7-D system in Figure [3j 

VI. Conclusion 

We have introduced a new, empirical model reduction method for nonlinear control systems. The method 
assumes that the nonlinear system is approximately linear in a high dimensional feature space, and carries 
out linear balanced truncation in that space. This leads to a nonlinear reduction map, which we suggest 
can be combined with representations of the dynamics and output functions by elements of an RKHS 
to give a closed reduced order dynamical system which captures the input-output characteristics of the 
original system. We then demonstrated an application of our technique to a pair of nonlinear systems 
and simulated the original and reduced models for comparison, showing that the approach proposed here 
can yield good low-order nonlinear reductions of strongly nonlinear control systems. We believe that 
techniques well known to the machine learning and statistics communities can offer much to control and 
dynamical systems research, and many further directions remain, including computing error estimates, 
reduction of unstable systems, structure preserving systems, stochastically perturbed systems, and finding 
easily verifiable conditions of model reducibility of nonlinear systems. 
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